CONCENTRATION BOUNDS FOR ENTROPY ESTIMATION OF 
ONE-DIMENSIONAL GIBBS MEASURES 



J.-R. CHAZOTTES, C. MALDONADO 



Abstract. We obtain bounds on fluctuations of two entropy estimators for a class 
of one-dimensional Gibbs measures on the full shift. They are the consequence of a 
general exponential inequality for Lipschitz functions of n variables. The first estimator 
is based on empirical frequencies of blocks scaling logarithmically with the sample 
length. The second one is based on the first appearance of blocks within typical 
samples. 
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1. Introduction 

Given a 'sample' xo,xi,... of a finite-valued discrete-time ergodic process {A'„;n G 
N}, there are several ways to consistently estimate its entropy. In this paper we shall 
study two estimators. One is based on empirical frequencies of blocks and is referred to 
as the 'plug-in' estimator. The other one is based on the first appearance or repetition 
of blocks within the sample. We refer to |13) for their basic properties. Here we are 
concerned with the fluctuation properties of these estimators. We will further assume 
that the joint distribution of the process n € N} is Gibbsian in a way made precise 
below. 

Fluctuations of the plug-in estimator were already studied in [10] and [5j from the 
viewpoint of classical limit theorems. Namely, in [lO] the authors prove a central limit 
theorem and in [5] a large deviation principle is obtained. 
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Regarding the return-time and the hitting-time estimators, previous results are found 
in [7] and [6]. Central limit theorems and large deviations principle are established in 
these papers. In the present article, we only study the hitting-time estimator. 

Our aim is to obtain bounds on the fluctuations of the plug-in and hitting-time es- 
timator in the spirit of concentration inequalities. Concentration inequalities became 
recently a widespread powerful tool in many fields of pure and applied probabilities, as 
well as in functional analysis, combinatorics, computer science, etc; see for instance [11] 
and [9]. In the context of dynamical systems, the first result was proved in [8] where 
several applications are presented (see also [1]). Namely, an exponential inequality is 
proved for any separately Lipschitz function of n variables for a class of piecewise ex- 
panding maps of the interval. In our setting the same inequality holds. The proof is the 
same as in |8j. It is in fact simpler since no Markov partition is assumed therein. In 
this paper we apply this exponential inequality to get some fluctuation bounds on our 
entropy estimators. The only previous work where this is done for the plug- in estimator 
is found in [2] in the case where the Xi's are independent identically distributed ran- 
dom variables taking values in a countable set. For the hitting-time estimator, no such 
bounds were known before, even in the case of independent random variables. 

Our main results are theorems 14. H 14.21 and 14.31 Let us emphasize that we establish 
bounds for every n, n being the sample length, whereas the results obtained in [5} l6l 171 [TO] 
are in some sense finer but they are only asymptotic. Therefore, our work complement 
the picture on the fluctuations of these entropy estimators. Theorems 14.11 and 14 . 2 1 concern 
the plug-in estimator. Theorem 14.31 is about the hitting-time estimator. It should be 
noted that the route to get the bounds is not as direct as for the plug-in estimator 
because the hitting-time a priori behaves badly. The trick is to take advantage of its 
approximation by the inverse measure of the corresponding cylinder. This is where 
Gibbsianness is crucial. 

This paper is organized as follows. In Section [2] we recall some definitions and facts, 
and state the exponential bound from which concentration inequalities follow. Section 
[3] contains our results on the plug-in estimators and the hitting-time estimator. 

2. Setting 

After fixing some notations, we recall a few facts about entropy and Gibbs measures. 

2.1. Notations and definitions. We consider the set = of infinite sequences x 
of symbols from the finite set A: x = xq, xi, . . . where Xj G A. We denote by a the shift 
map on ^l: {ax)i = Xi+i, for all i = 0, 1, . . .. 

We equip Q, with the usual distance: fix G (0, 1) and for x^y, let d0{x, y) = 9^ where 
is the largest nonnegative integer with Xi = yi for every < i < N . (By convention, 
if X = y then N = oo and 9°° = 0, while if xq ^ yo then N = 0.) With this distance Q 
is a compact metric space. 

For a given string Oq"^ = ao; • • • ,o,k-i {ai G A), the set [oq"^] = G 17 : Xj = a^, z = 
1, . . . , — 1} is the cylinder of length k based on oq, . . . , flfc-i- 
For a continuous function / : — )• M and m > we define 



varm(/) := sup{|/(x) - f{y)\ : xi = y^, i = 0, . . . ,m}- 
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It is easy to see that \ f{x) — f{y)\ < Cdg{x, y) if and only if varm(/) < 6*6*™, m = 0, 1, . . .. 
Let 

^0 = {f ■ f continuous, vaTmiJ) < C6"^, m = 0, 1, . . . , for some C > O}. 
This is the space of Lipschitz functions with respect to the distance dg. For / £ let 
\f\e = sup We notice that \f\o is merely the least Lipschitz constant 

of/. 

2.2. Entropy. Let v he a shift-invariant probability measure on Q and 

its 'A;-block entropy'. Then the entropy of u is 

h{u) = hm 

Recall that < h{i^) < log \A\. 

2.3. Gibbs measures. Full details for this section can be found in [3j. Let <j) G ^g and 
fi^ the associated Gibbs measure. It is the unique shift-invariant probability measure 
for which one can find constants C = C{<j)) > 1 and P = P{(j)) such that 

^-1 < f^4{y:yi = xi,yie[0,m)}) ^ ^ 

exp (^-Pm + Er=o^ 4>icr''x)^ 

for every x € Q and m > 1. The constant P is the topological pressure of (j). We can 
always assume that P = by considering the potential <j) — P which yields the same 
Gibbs measure. 

The Gibbs measure fi^j, satisfies the variational principle, namely 



sup + J (t>dri : i] shift-invariant | = h{fi^) + J 



P = 0. 



More precisely, fi^p is the unique shift-invariant measure reaching this supremum. In 
particular we have 



3. An EXPONENTIAL INEQUALITY AND ITS GENERAL CONSEQUENCES 

3.1. An exponential inequality. Our main tool is an exponential inequality for fairly 
general observables. 

Let K : fi" — )• R be a function of n variables and, for each j = 0, . . . , n — 1, let 

Lip j{K) = sup sup 

. . . ,x^^-^\x^^\x^^+'^\ . . . - . . . ,x(^-^),y(j'),x(j'+^), . . . | 



4 J.-R. CHAZOTTES AND C. MALDONADO 

We shall say that X is a separately Lipschitz function of n variables if 

Lipj{K) < oo, j = 0, . . . ,n — 1. 
We now present our main tool. 

Theorem 3.1 (ISJ). Let fi(j) be a Gibbs measure. Then there exists a constant D = 
D{(j)) > such that, for any integer n > 1 and for any separately Lipschitz function K 
of n variables, one has 



Let us emphasize that the constant D only depends on (p. It depends neither on K 
nor on n. 

The powerfulness of Q lies in that it applies to any separately Lipschitz function of n 
variables, regardless of its complicated or implicit form. All we have to do is to estimate 
its Lipschitz constants. 

The proof of Theorem p.ip easily follows from |8j where it is done for a class of 
piecewise expanding maps of the interval (without assuming a Markov partition). In 
fact, the proof becomes simpler in our setting. It relies on the fact that the transfer 
operator associated to 4> has a spectral gap when acting on Lipschitz functions. The 
point is to write an observable of n variables as a telescopic sum of observables depending 
only on one variable. 

3.2. General consequences. We derive several consequences of inequality ([3]). 
The first one is a bound for the probability of K to deviate from its expectation. 

Corollary 3.1. For every t > 0, one has 

(4) |x : K{x, . . . > I K{y, . . .,a^-'y) d/x^y) + t| < e^^^^^S^^. 

Proof. The proof is an immediate consequence of Markov inequality and ([3]): for every 
A > 0, the function XK is separately Lipschitz and 

fi^ |x : K{x, . . . , a^-^x) > j K{y, . . . , d^^(y) + t| 

< e-At j X [i^fe...,."-!.)-/ i^(,,...,<x"-i,)d^,(,)] 

< g-At+A2DEr=o'Lip?(X)_ 

It remains to optimize over A to get the desired inequality. □ 
Of course we can apply dH) to —K and get by a union bound that 

/u<^|x: |/^fe...,a"-ix)-y" Kiy,...,a^-'y) d^4y)| > t| < 2 e"^^« Lip^^ 

for every t > 0. 

Another immediate consequence of (l3|) is a bound on the variance of K: 
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Corollary 3.2. One has 

n-l 



Proof. We apply ([3]) to XK, with A 7^ 0, to get at once 

The result follows by Taylor expansion and letting A going to 0. □ 

The simplest, yet non-trivial, application of the above results is to ergodic sums, that 
is to take Ko{x^°\x^^\ . . . , = + /(x^^)) + • • • + /(^("-^)) where f -.Q^R 

is Lipschitz. A particular case of Corollary 13.11 yields immediately the following result, 
stated for later convenience. 

Corollary 3.3. Let / : $7 ^ M 6e a Lipschitz function. Then 

(5) //^ |x : ^{f{x) + ■■■ + fia^-'x)) - J fdfi^ > t| < e"^"*' 

for every t > and for every n > 1, where B := {4:D\f\g)~^ . 
We can of course apply (0) to — / to get 

(6) /^<^ |x : i (/(x) + • • • + fia^-'x)) - J fdfi^ < } < e^^"*' . 
By a union bound, ([5]) and ^ yield at once 

-Bnt"^ 



X : 



l{fix) + --- + f{a^-'x))- //d/i^ 



>t}<2e~ 



In words, the ergodic average of / concentrates sharply around its /x^^-average. The 
bound is exponentially small in n and when t gets large, the probability of deviation is 
extremely small. 

Let us close this section by a basic observation. Many estimators of interest are 
functions of n symbols, that is, functions of the form K : — > R. A function K : A" — > 
M can be identified with a function K : $7" — > M. When applying Theorem 13.11 and its 
corollaries in this special case. Lip j{K) has to be replaced by 6j{K), the oscillation at 
the j-th coordinate, where 

(7) = sup sup 

\K{ao, . . . , flj-i, Cj, Oj+i, . . . , a-n-i) — K{ao, . . . , aj-i, bj,aj+i, . . . , o„_i)|. 

4. Bounds on entropy estimators 
Throughout this section, (p £ and 11^ is its unique Gibbs measure. 
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4.1. Plug-in estimator. The plug-in estimator is based on the empirical frequency of 
a word fflg ""^ in a 'sample' xq,xi, . . . ,Xn~i- 

£,{4-'; xr^) = ^#{0 <j<n: 3:^'-' = a^'}, 

where x := Xq~^Xq^^ • • • is the periodic point with period n made from Xq~^ . This trick 
makes £k{-', Xq~^) a locally shift-invariant probability measure on A''. 

For any ergodic measure i', there is a set of i^-measure one of x's such that for every 
k > 1 

The fc-block empirical entropy is defined as 

#fe«-l):=Ffc(^,(•;x^^)). 
It is clear that for z^-almost every x 

lim hm r^^i^Li = h{u). 

fc— >oon— >-oo k 

As shown by Ornstein and Weiss (see [13]), we can in fact take a single limit by letting 
k depend on n: if k{n) oo and k{n) < logn then 

lim — ^^"•' ^ ^ — ^ = h{v) for v — almost-every x. 
n-s-oo k[n) 

Note that since h{i') < log \ A\ we can always take k{n) < jj^g^ logn. 

We can formulate our first result on fluctuations of the plug-in entropy estimator. We 
denote by E the expectation and by Var the variance under /z^. 

Theorem 4.1. Let D be the constant appearing in ([3]). For every a € (0, 1), t > and 

n > 2 one has 



k{n) \ k{n 



> t ^ < 2 exp 



16L>(logn)2 



provided that k{n) < fkj^]^ logn. 
Moreover for every n > 2 



, k{n) I ^ ^ ni-° 
Proof. Given any integer A: > 1, consider the function K : A"' — > M defined by 

We estimate the 6j{K)^s (see ([7]) for the definition of 6j{-)). We claim that 

6j{K) <2k\Af^^, Vi = 0,...,n-1. 

Indeed, given any string a^"^, the change of one symbol in Sq~^ can decrease f (oq"^; Sq~^) 
by at most k/n. It is possible that another string gets its frequency increased, and this 
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can be at most by k/n. This is the worst case. We then use the fact that for any pair 
of positive integers / and k such that I + k < n, one has 



< — logn. 

n J \n / \ n / \ n / n 

The claim follows by summing up this bound for all strings, which gives the factor \A\^. 
Finally, taking k{n) < 21,^\a\ ^oS^' with a S (0, 1), and applying Corollaries [3TT] and [321 
we get the desired bounds. □ 

It is natural to seek for a concentration bound for the empirical entropy not about its 
expectation, but about h{fi^), the entropy of the Gibbs measure. To have good control 
on this expectation, it turns out that a better estimator is the conditional empirical 
entropy. To define it, we need to recall a few definitions and facts. 

For a shift-invariant measure v and A; > 2, let 



It is well-known that lim,t_>oo ^fc(^^) = ^(^) (see for instance 
The fe-block conditional empirical entropy is 

When ly is ergodic, one can prove |13j that, if k{n) — )• oo and k{n) < j^^-^ logn, for any 
e G (0,1), then 

lim hi^(n)(x]!.~^) = hiu), for v — almost every x. 
We have the following result. 

Theorem 4.2. Assume that 9 < |^|^^. There exist strictly positive constants c, 7, F,^ 
such that for every t > and for every n large enough 

ri- c ^ f Tn^t^ \ 

/^^{\hk(n)-h{fi^) >t + —j<2ew V (logn)^ ) 

provided that k{n) < ^|°||^| . 

Remark 4.1. From the proof we have 7 = 1/(1 + iog('j-l) )> = 1 ~ 2/(1 -|- ^^jf^^) and 
F = (log 1^1)2/161?. 

Proof. By definition hk = H]. — Hj^^i. If we let ^'(sq, . . . , Sn-i) = ^A;(so"^), we estimate 

We now estimate the expectation of . We need the following lemma. 
Lemma 4.1. We have 

^ 71—1 

(8) kin){xr') = - Y.^-4>{<y'x)) + A,(„)(x^^) + ©(^^W) 

" i=o 
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where 

(9) |E(A,(„))|<^^ , 

where M > 0. 

This lemma can be deduced from the proof of Theorem 2.1 in [TO]. However, for the 
reader's convenience, we provide part of its proof in the appendix. 

Now substract h{^^) and take the expectation on both sides of ([8]), to get, using ([2]), 

We now take k{n) = (7 log n/ log |j4|, where < (7 < 1 has to be determined. Choosing 

^ = V(i + tItat) easily t^at 

(10) \&{hki^n))-h{^i^)\<^, 

where c > is some constant and 7 = l/(l + 

To end the proof, we apply Corollary 13.11 and use pup . For the exponent ^ in the 
statement of the theorem be strictly positive, one must have q < 1/2, which is equivalent 
to the requirement that 6 < \A\^^. □ 

4.2. Hitting times. Given x, y € 17, let 

Wn{x,y) = inf{i > 1 : yf = x^^}- 

Under suitable mixing conditions on the shift-invariant measure one can prove [T3] 
that ^ 

lim — log Wn{x, y) = h{v), iov u v — almost every (x, y). 

n— >oo n ~ ~ 

In particular, when is a Gibbs measure in the above sense, this result holds true [B]. 
We have the following concentration bounds for the hitting-time estimator. 

Theorem 4.3. There exist constants C\,C2 > and to > such that, for every n>l 
and every t > to, 

(11) ifi^'^fi^) |(x,y) : ^logl^„(x,y) > h{ti^) + t^ < Cie-^^"*' 
and 

(12) (/i0®^0)|(x,y) : ^logWnix,y) <h{fi^)-t\ < de^^^"*. 



Let us notice that the upper tail estimate behaves differently than the lower tail 
estimate as a function of t. This asymmetric behavior also shows up in the large deviation 
asymptotics [6]. 

Let us sketch the strategy to prove Theorem 14.31 We cannot apply directly our 
concentration inequality to the random variable Wn for the following basic reason. Given 
X and y, the first time that one sees the first n symbols of x in y is Wn{x,y) and 
assume it is finite. If we make y' by changing one symbol in y, we have a priori no 
control on Wnix,y') which can be arbitrarily larger than Wn{x,y) and even infinite. Of 
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course, this situation is not typical, but we are forced to use the worst case to apply our 
concentration inequality. Roughly, we proceed as follows. We obviously have log Wn = 
log{WnfJ-(f,{[XQ~^])) — log //(^([Xq ~^]). On the one hand, we use a sharp approximation 
of the law of the random variables WnH<j,{[XQ~^]) by an exponential law proved in [1]. 
On the other hand, by the Gibbs property, log/X(^([xQ~^]) ^ 4>{x) + ■ ■ ■ + 0(cr"~-^x) and 
we can use Corollary 13.31 for / = 



Proof. We first prove ()lip . We obviously have 



(^0 ® /U^) \ {x,y) : -logWn{x,y) > h{fi^) +t 

«) At</,) \ {x,y) : ^logWn{x,y) + ^ log //^([xo"^]) - ^ log - h{fi^) > t 



< W®/i<^)|(x,y) : ^log[Wnix,y)n4[x]^ ^])] > ^ 

+ /i<^ |x : -ilog/i0([x;j~^]) - h{fi^) > ^1 
= :Ti+r2. 



We first derive an upper bound for T2. 
We use ([T|) , Corollary 13.31 applied to / 



and (El) to get 



T2 < n,j>{--{(t)-\ \-(t>oa' 



Kh) > I - - logC 

^ Til 



< e 



-Bnt^ 



for every t larger than 2 log C. 

We now derive an upper bound for Ti. To this end we apply the following result which 
we state as a lemma. It is an immediate consequence of Theorem 1 in [1]. 

Lemma 4.2 ([!]). Let 



V-l(y) :=inf{J>l:^/f"-^ = a^^}. 



There exist strictly positive constants C, c, Ai,A2, with Ai < A2, such that for every 
n G N, every string a^'^ , there exists \{aQ^'^) G [Ai,A2] such that 



< Ce- 



A(a^l)^^([a^^]) j 
for every n > 0. 

By definition and using the previous lemma we get 



^0 
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for some c', C" > 0. 

Since the bound for Ti is (much) smaher than the bound for T2, we can bound Ti + T2 
by a constant times e"'^"'*^. This yields ()lip . 
We now turn to the proof of ()12p . We have 

{H4> ^ I (x, y) : ^ log Wnix, y) < h{fi^) - t 

= if^cf) <^ f^cf>) I {X,y) ■■ --logWn{x,y) - -\0gH^{[xQ~^]) + -log/i0([Xo"^]) + h{fl^) > t 
I ?Tj 7Tj Ti/ 

< : -^log [Wn{x,y)n^{[x'^-^])] > ^ 

+ ^0 1^ : ^ log fi<p{[x^~^]) + h{n^) > ^ 
= Ti + T^. 

Proceeding as for T2 (applying Corollary 13.31 to f = (p) we obtain the upper bound 
T^</^0|^(</' + --- + </>oa"-i) - j </,d^^> i-ilogc| 

for some C" > and for every t > 2 log C. 

To bound T{ we use the following lemma (Lemma 9 in [Ij). 

Lemma 4.3 ([1|). For any v > and for any aQ~^ such that vfi^{[aQ~^]) < 1/2, one 
has 

log^^^jrj^n^i^^;} 

where Ai,A2 are the constants appearing in Lemma \4.2\ 
The previous lemma implies that 

provided that v/U0([a^"^]) < 1/2. Taking v = e-"*/^ it follows that 

Ti = ^ M^(K"i]) {y : r[^„-i,(y)/.4[ao"-^]) < e~-'/'} 



^0 



< A2 e-"*/2_ 

This inequality holds if e-"*/2^0([a^-i]) < 1/2, which is the case for any n > 1 if 
t > 2 log 2. 

Inequality follows from the bound we get for T[ + T2. But the bound for T[ is bigger 
than the one for Tg, whence the result. 

The proof of the theorem is complete. □ 
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Appendix A. Proof of Lemma [TT] 
We start with the following identity: 



c ( k—2 n—l\ 
,fc-ll 



(13) =- E g,(ag-^.g-^)log ^^;i"°]^i; +A,(xg-^), 

where 

c I k—\ n—\\ c ( k—2 n— In 

^fclQ-Q ) ^ c I'^fc-l. ^n-lM„^ ^fc-lV"0 1^0 ) 



- y gfc(ag-^xg-^)log ^"^"o ;7 ' + y ^fc(a^i;<-i)log: 

(14) - Hk{£k{-,Xo-^)\l^^) + <-^)|/x^), 

where 



is the A;-block relative entropy of rj with respect to n^. The second term in ()14p is equal to 
Hk~-i{£k~ii-',XQ~^)\fifp) because of the following two facts. First, "^ao^A ^fe("o~^' ^o~^) ~ 
£k~iici-i~^',XQ~^). This is because £^/;(;Xq~^) is a locally shift-invariant probability mea- 
sure on a''. Second, '^fc(^o~^' ^o~^) ~ £k-iiO'0~'^', Xq~^), because the family 
{£ki;xQ~^))k=i,2,... is consistent. 

The quantity |Afc(xQ~"^)| is bounded above by {M\A\'')/n according to [lOl formula 
(4.16)], where M > is a constant. 

Now we deal with the first term in (1131). We first introduce the function 



I^M ]) 

which is a locally constant function on cylinders of length k. It is easy to verify that 
\\4> - 4>k\\oo < {(pleO'' (this follows at once from ^ Prop. 3.2 p. 37]). We get that 

/r A;— 1]\ 1 

- J: £,{at';xr'nogi^^^^ 
The proof of the lemma is complete. 
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